Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename patched filters so the normalizer uses the default ones #409

Conversation

wgresshoff
Copy link
Contributor

@wgresshoff wgresshoff commented Oct 8, 2024

❤️ Thank you for your contribution!

Close #408

Description

The normalizer that was introduced in the funders index with v5.1.0 generates two tokens when used on documents that contain diacritics in the acronym. This behaviour occurs when in the filter definition "preserve_original": true is used. This is ok for analyzers, so this PR just renames the filters and takes care the analyzers use the new ones and the normailzer uses the original ones.

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

Frontend

Reminder

By using GitHub, you have already agreed to the GitHub’s Terms of Service including that:

  1. You license your contribution under the same terms as the current repository’s license.
  2. You agree that you have the right to license your contribution under the current repository’s license.

@wgresshoff wgresshoff force-pushed the bug/408-fix-acronyms-with-diacritics branch from 0e369eb to d3c20ca Compare October 8, 2024 12:51
@wgresshoff wgresshoff force-pushed the bug/408-fix-acronyms-with-diacritics branch from d6f3378 to ae3b4f2 Compare October 11, 2024 08:33
@@ -14,8 +14,8 @@
"type": "custom",
"char_filter": ["strip_special_chars"],
"filter": [
"lowercase",
"asciifolding",
"lowercasepreserveoriginal",
Copy link
Contributor

@kpsherva kpsherva Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgresshoff could I ask you to move all the changes to funder-v3.0.0.json files?
in case of breaking changes like these we need to upgrade the version of the mapping, otherwise instance upgrades are complex
so we need to (for both os-v1 and os-v2)

  • copy the mappings to funder-v3.0.0.json
  • move the changes you proposed here to funder-v3.0.0.json file
  • revert funders-v2.0.0.json changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in practice this doesn't break existing indices created with the old mapping but only affects search behavior/accuracy, I feel we can merge without bumping the mapping version.

This means that for deployment, one needs to either:

  • Do the usual recreation of indices (with downtime)
  • Do a live migration with some special handling since the old and new index name will be the same (only the timestamp suffix changes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used the recreation of indices with downtime and the missing funders were reindexed without problem.

@wgresshoff wgresshoff force-pushed the bug/408-fix-acronyms-with-diacritics branch from ae3b4f2 to 6cbca0b Compare October 15, 2024 09:02
@kpsherva kpsherva merged commit dace099 into inveniosoftware:master Oct 16, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To release 🤖
Development

Successfully merging this pull request may close these issues.

The newly introduced normalizer in the funders vocab generates two tokens when acronyms contain diacritics
4 participants